1. Extracting tweets from twitter
- Fields like tweet content, tweet date, retweet count, tweet language, user location, user follwers, user verified are used
2. Storing these tweets in MongoDB using Python MongoDB connector
3. Connecting to MongoDB from Python
4. Analysing (pre processing and transforming) the tweets in python using PySpark and generating wordclouds on text data
5. Exporting the summarised data to Google Drive using an API
6. Importing data from Google drive and,
7. Visualising data in Tableau
Note: The entire pipeline is implemented on a Jetstream VM
source: google images
The main reason behind choosing this topic is that most of us are directly or indirectly affected by Covid and few reasons for the severe spread of the virus have been a) unavailability of vaccines b) reluctance to get vaccinated, c) improper masking techniques d) ineffective border control etc.,. So, i focus on part b i.e., reluctance of people to get vaccinated by understanding the sentiment of people towards various vaccines from the social media platform twitter.
Although the covid vaccines are proven to be quite effective (efficacy rates of ~90-95% in clinical trails) in dealing with the viruses, there has been a lot of misunderstanding, scepticism and stigma in the public around these vaccines because of the spread of misinformation. This has resulted in lower vaccination rates in some of the countries causing increased spread of the virus. So, it is extremely important to understand the sentimenty of vaccines among people and create awareness to contain the spread of virus.
In this project, i primarily analyse the sentiment for different vaccines to see which vaccines people trust the most and how the sentiment has evolved with the vaccination rates to understand if vaccinations lead to positive sentiment.
I have used the MongoDB to store the tweets and world vaccination data. I created a free shared account in the Iowa region with 3 clusters and a storage limit of 512 Mb.
MongoDB is a resiliant distributed storage system where the files are stored on different cluster replicas (1 primary and 2 secondary in this case)
A PySpark VM instance is launched on Jetstream for the purpose of this analysis. necessary packages like Jupyter Notebook, Py4J are installed.
Specs: m1 quad (4 cpu, 10 GB memory and 20 GB disk size
1. The tweets are extracted from twitter using snscrape for each of the vaccines in the form of a Json object. These tweets are stored in a local folder and are later written to the Mongo DB.
2. Below code shows the tweets are extracted for 'Moderna' keyword between 1/1/2022 and 4/30/2022 and a maximum of 25000 latest tweets are extracted. Similarly, the tweets are extracted for Covaxin and Pfizer vaccines as well.
import os
os.system("snscrape --jsonl --max-results 25000 --since 2022-01-01 twitter-search 'moderna until:2022-04-30' > /home/svallur/bigdata/moderna_tweets.json")
512
The world vaccination data is extracted from https://ourworldindata.org in the form of csv. This contains the daily vaccination stats by each day for each country.
The tweets Json data extracted is written to the Covid_tweets database in MongoDB using MongoClient
Below image shows the Covid_tweets database after all tghe relevant data is inserted
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,BooleanType,DoubleType
spark = SparkSession.builder \
.master("local[1]") \
.appName("svallur") \
.getOrCreate()
import pymongo
import pandas as pd
client = pymongo.MongoClient("mongodb+srv://svallur:11!!aaAA@cluster0.lf8mp.mongodb.net/Covid_tweets?retryWrites=true&w=majority")
The vaccines data is read into python and only relevant columns are included and rest of the columns in the original data are ignored. The same steps are repeated for all three vaccines.
These datasets are converted to Spark dataframes in the subsequent steps
db = client["Covid_tweets"]
col = db["covaxin"]
x = col.find({})
covaxin_dict = []
for i in x:
tweet_dict = {}
tweet_dict['date'] = i['date']
tweet_dict['content'] = i['content']
tweet_dict['keyword'] = 'covaxin'
tweet_dict['retweetcount'] = i['retweetCount']
tweet_dict['lang'] = i['lang']
tweet_dict['verified'] = i['user']['verified']
tweet_dict['followersCount'] = i['user']['followersCount']
tweet_dict['location'] = i['user']['location']
tweet_dict['friendsCount'] = i['user']['friendsCount']
covaxin_dict.append(tweet_dict)
# Read JSON file into dataframe
covaxin_data = spark.createDataFrame(covaxin_dict)
covaxin_data.printSchema()
covaxin_data.show()
root |-- content: string (nullable = true) |-- date: string (nullable = true) |-- followersCount: long (nullable = true) |-- friendsCount: long (nullable = true) |-- keyword: string (nullable = true) |-- lang: string (nullable = true) |-- location: string (nullable = true) |-- retweetcount: long (nullable = true) |-- verified: boolean (nullable = true) +--------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ | content| date|followersCount|friendsCount|keyword|lang| location|retweetcount|verified| +--------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ |@Wendell_JRS @Ped...|2022-04-29T23:44:...| 1| 63|covaxin| pt| | 0| false| |This company is t...|2022-04-29T23:39:...| 328| 57|covaxin| en| Las Vegas, NV| 1| false| |@MonicaGandhi9 De...|2022-04-29T23:29:...| 7| 543|covaxin| en| The Moon| 0| false| |[29/04] COVISHIEL...|2022-04-29T23:26:...| 85| 3|covaxin| et| | 0| false| |@US_FDA Approve c...|2022-04-29T23:23:...| 52| 148|covaxin| en| | 0| false| |https://t.co/gHqd...|2022-04-29T23:04:...| 449| 1172|covaxin| en| | 10| false| |Let me get this s...|2022-04-29T23:00:...| 449| 1172|covaxin| en| | 6| false| |@nathaliejacoby1 ...|2022-04-29T22:49:...| 1630| 1728|covaxin| en| Seattle| 0| false| |@lucasvivot Mesmo...|2022-04-29T22:42:...| 29| 374|covaxin| pt| Florianópolis - SC| 0| false| |@paiva "Fora Vogu...|2022-04-29T22:37:...| 29| 374|covaxin| pt| Florianópolis - SC| 0| false| |La OMS también ac...|2022-04-29T22:29:...| 4195| 1159|covaxin| es| | 5| false| |@Grauberrr @rafae...|2022-04-29T22:29:...| 251| 1161|covaxin| pt| | 0| false| |@catholic2lut @in...|2022-04-29T22:26:...| 2203| 5001|covaxin| pt|Rio de Janeiro, B...| 0| false| |@TerraBrasilnot C...|2022-04-29T22:26:...| 10| 195|covaxin| fi| | 0| false| |@gislain51878130 ...|2022-04-29T22:21:...| 2| 18|covaxin| pt| | 0| false| |@CPHO_Canada Lors...|2022-04-29T22:18:...| 9| 32|covaxin| fr| Montréal | 0| false| |@renancalheiros A...|2022-04-29T22:13:...| 501| 496|covaxin| pt|Montes Claros, Br...| 0| false| |@o_antagonista Qu...|2022-04-29T22:10:...| 364| 474|covaxin| pt| São Paulo| 0| false| |@AshishKJha46 Why...|2022-04-29T22:07:...| 29| 84|covaxin| en| | 0| false| |@vanpetrovisch @r...|2022-04-29T22:04:...| 79| 148|covaxin| pt| | 0| false| +--------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ only showing top 20 rows
db = client["Covid_tweets"]
col = db["pfizer"]
x = col.find({})
pfizer_dict = []
for i in x:
tweet_dict = {}
tweet_dict['date'] = i['date']
tweet_dict['content'] = i['content']
tweet_dict['keyword'] = 'pfizer'
tweet_dict['retweetcount'] = i['retweetCount']
tweet_dict['lang'] = i['lang']
tweet_dict['verified'] = i['user']['verified']
tweet_dict['followersCount'] = i['user']['followersCount']
tweet_dict['location'] = i['user']['location']
tweet_dict['friendsCount'] = i['user']['friendsCount']
pfizer_dict.append(tweet_dict)
# Read JSON file into dataframe
pfizer_data = spark.createDataFrame(pfizer_dict)
pfizer_data.printSchema()
pfizer_data.show()
root |-- content: string (nullable = true) |-- date: string (nullable = true) |-- followersCount: long (nullable = true) |-- friendsCount: long (nullable = true) |-- keyword: string (nullable = true) |-- lang: string (nullable = true) |-- location: string (nullable = true) |-- retweetcount: long (nullable = true) |-- verified: boolean (nullable = true) +-------------------------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ | content| date|followersCount|friendsCount|keyword|lang| location|retweetcount|verified| +-------------------------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ | @DrWoodcockFDA I ...|2022-04-29T23:59:...| 100| 181| pfizer| en| Western Australia| 0| false| | @ernestorr Ernest...|2022-04-29T23:59:...| 486| 744| pfizer| es| CABA| 0| false| | @jaredpolis we ne...|2022-04-29T23:58:...| 4| 95| pfizer| en| | 0| false| | @ArtysHouse @isth...|2022-04-29T23:58:...| 75| 130| pfizer| en| Throng la Pass| 0| false| | Thirteen percent ...|2022-04-29T23:58:...| 73| 244| pfizer| en| | 0| false| | @NoticiasONU Pfiz...|2022-04-29T23:58:...| 76| 108| pfizer| pt| | 0| false| | Fact Check: Pfize...|2022-04-29T23:58:...| 13| 85| pfizer| en| | 0| false| | อื้อฮือ!! เข็มที่...|2022-04-29T23:58:...| 1482| 418| pfizer| th|กรุงเทพมหานคร, ปร...| 33| false| | @EU_Commission @E...|2022-04-29T23:58:...| 52| 96| pfizer| fr| | 0| false| | Η Pfizer ανακοίνω...|2022-04-29T23:57:...| 34| 118| pfizer| el| | 0| false| | Refuerzo de la va...|2022-04-29T23:57:...| 2328| 2158| pfizer| es| Merida,Ve| 0| false| | @mklacct @labdato...|2022-04-29T23:56:...| 24152| 161| pfizer| es| Guatemala| 1| false| | @JimDelos02 @Heac...|2022-04-29T23:56:...| 1656| 1869| pfizer| en| | 0| false| | Daniel's NOFRILLS...|2022-04-29T23:56:...| 1184| 5| pfizer| en| | 0| false| |新型コロナウイルス感染症患者と接触...|2022-04-29T23:56:...| 133146| 205| pfizer| ja| 日本| 27| false| | @Pfizer’s Covid-1...|2022-04-29T23:55:...| 1677| 5000| pfizer| en| Winter Park, FL| 0| false| | Former Pfizer Vp ...|2022-04-29T23:55:...| 269| 2270| pfizer| de| | 0| false| | @AndrewBCaldwell ...|2022-04-29T23:55:...| 1242| 1148| pfizer| en| Blacksburg, VA| 0| false| | ¡Atención! 💉 El...|2022-04-29T23:55:...| 55084| 1059| pfizer| es|Villahermosa, Tab...| 4| false| | procedurą, na któ...|2022-04-29T23:55:...| 2847| 2742| pfizer| pl| Łódź, Poland| 5| false| +-------------------------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ only showing top 20 rows
pfizer_data.count()
25000
db = client["Covid_tweets"]
col = db["moderna"]
x = col.find({})
moderna_dict = []
for i in x:
tweet_dict = {}
tweet_dict['date'] = i['date']
tweet_dict['content'] = i['content']
tweet_dict['keyword'] = 'moderna'
tweet_dict['retweetcount'] = i['retweetCount']
tweet_dict['lang'] = i['lang']
tweet_dict['verified'] = i['user']['verified']
tweet_dict['followersCount'] = i['user']['followersCount']
tweet_dict['location'] = i['user']['location']
tweet_dict['friendsCount'] = i['user']['friendsCount']
moderna_dict.append(tweet_dict)
# Read JSON file into dataframe
moderna_data = spark.createDataFrame(moderna_dict)
moderna_data.printSchema()
moderna_data.show()
root |-- content: string (nullable = true) |-- date: string (nullable = true) |-- followersCount: long (nullable = true) |-- friendsCount: long (nullable = true) |-- keyword: string (nullable = true) |-- lang: string (nullable = true) |-- location: string (nullable = true) |-- retweetcount: long (nullable = true) |-- verified: boolean (nullable = true) +--------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ | content| date|followersCount|friendsCount|keyword|lang| location|retweetcount|verified| +--------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ |call moderna http...|2022-04-29T23:59:...| 79| 191|moderna| en|Itaperacema do Bo...| 0| false| |If moderna ceo ca...|2022-04-29T23:59:...| 1313| 1452|moderna| en| | 0| false| |@aarahkahak @mode...|2022-04-29T23:59:...| 131| 569|moderna| en| United States| 0| false| |A Juliette vendo ...|2022-04-29T23:59:...| 336| 234|moderna| pt| 🌙| 0| false| |@newscomauHQ More...|2022-04-29T23:59:...| 1| 37|moderna| en| | 1| false| |@Erwinpohlagurto ...|2022-04-29T23:59:...| 87| 223|moderna| es| | 0| false| |@ernestorr Ernest...|2022-04-29T23:59:...| 486| 744|moderna| es| CABA| 0| false| |10 horas que me p...|2022-04-29T23:59:...| 380| 912|moderna| es| santa filomena | 0| false| |@canalN_ Mañana v...|2022-04-29T23:59:...| 9| 102|moderna| es| | 0| false| |Fact Check: Pfize...|2022-04-29T23:58:...| 13| 85|moderna| en| | 0| false| |🤔 Bueno, es una ...|2022-04-29T23:58:...| 2169| 770|moderna| es| PANAMA | 0| false| |@CucarachoLacho B...|2022-04-29T23:57:...| 2789| 2329|moderna| es| | 0| false| |🎯 “¿No será que ...|2022-04-29T23:57:...| 6435| 6081|moderna| es| Spain| 0| false| |@mklacct @labdato...|2022-04-29T23:56:...| 24152| 161|moderna| es| Guatemala| 1| false| |@Sapo_pepe77 Abri...|2022-04-29T23:56:...| 74| 935|moderna| pt| Mount Olympus, UT| 0| false| |@arjendegeus @mke...|2022-04-29T23:56:...| 14| 79|moderna| nl| | 0| false| |se for pra entreg...|2022-04-29T23:56:...| 900| 665|moderna| pt| MARVEL CITY | 0| false| |And yet, and yet,...|2022-04-29T23:56:...| 61| 441|moderna| en| | 0| false| |Moderna was a tro...|2022-04-29T23:56:...| 25571| 2211|moderna| en| Canada| 15| false| |¡¡Esto es fruto d...|2022-04-29T23:56:...| 601| 555|moderna| es| Panamá| 0| false| +--------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+ only showing top 20 rows
moderna_data.count()
25000
covid_data = pfizer_data.union(covaxin_data)
covid_data = covid_data.union(moderna_data)
covid_data.count()
75000
import string
import re
def remove_punct(text):
regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')
nopunct = regex.sub(" ", text)
return nopunct
The tweets are cleaned to remove any punctuation marks annd then passed through TextBlob function to get sentiment. This new information is added to the datframe
from pyspark.sql.functions import udf
from textblob import TextBlob
sentiment = udf(lambda x: TextBlob(remove_punct(x)).sentiment[0])
spark.udf.register("sentiment", sentiment)
covid_data = covid_data.withColumn("sentiment",sentiment('content').cast('double'))
covid_data.show()
+-------------------------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+--------------------+ | content| date|followersCount|friendsCount|keyword|lang| location|retweetcount|verified| sentiment| +-------------------------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+--------------------+ | @DrWoodcockFDA I ...|2022-04-29T23:59:...| 100| 181| pfizer| en| Western Australia| 0| false|-0.05000000000000002| | @ernestorr Ernest...|2022-04-29T23:59:...| 486| 744| pfizer| es| CABA| 0| false| 0.0| | @jaredpolis we ne...|2022-04-29T23:58:...| 4| 95| pfizer| en| | 0| false| 0.2| | @ArtysHouse @isth...|2022-04-29T23:58:...| 75| 130| pfizer| en| Throng la Pass| 0| false|-0.38749999999999996| | Thirteen percent ...|2022-04-29T23:58:...| 73| 244| pfizer| en| | 0| false| 0.0| | @NoticiasONU Pfiz...|2022-04-29T23:58:...| 76| 108| pfizer| pt| | 0| false| 0.0| | Fact Check: Pfize...|2022-04-29T23:58:...| 13| 85| pfizer| en| | 0| false| 0.0| | อื้อฮือ!! เข็มที่...|2022-04-29T23:58:...| 1482| 418| pfizer| th|กรุงเทพมหานคร, ปร...| 33| false| 0.0| | @EU_Commission @E...|2022-04-29T23:58:...| 52| 96| pfizer| fr| | 0| false| 0.0| | Η Pfizer ανακοίνω...|2022-04-29T23:57:...| 34| 118| pfizer| el| | 0| false| 0.0| | Refuerzo de la va...|2022-04-29T23:57:...| 2328| 2158| pfizer| es| Merida,Ve| 0| false| 0.0| | @mklacct @labdato...|2022-04-29T23:56:...| 24152| 161| pfizer| es| Guatemala| 1| false| 0.0| | @JimDelos02 @Heac...|2022-04-29T23:56:...| 1656| 1869| pfizer| en| | 0| false| -0.5| | Daniel's NOFRILLS...|2022-04-29T23:56:...| 1184| 5| pfizer| en| | 0| false| 0.35000000000000003| |新型コロナウイルス感染症患者と接触...|2022-04-29T23:56:...| 133146| 205| pfizer| ja| 日本| 27| false| 0.0| | @Pfizer’s Covid-1...|2022-04-29T23:55:...| 1677| 5000| pfizer| en| Winter Park, FL| 0| false|-0.02000000000000...| | Former Pfizer Vp ...|2022-04-29T23:55:...| 269| 2270| pfizer| de| | 0| false| 0.0| | @AndrewBCaldwell ...|2022-04-29T23:55:...| 1242| 1148| pfizer| en| Blacksburg, VA| 0| false| 0.06944444444444443| | ¡Atención! 💉 El...|2022-04-29T23:55:...| 55084| 1059| pfizer| es|Villahermosa, Tab...| 4| false| 0.0| | procedurą, na któ...|2022-04-29T23:55:...| 2847| 2742| pfizer| pl| Łódź, Poland| 5| false| 0.0| +-------------------------------------+--------------------+--------------+------------+-------+----+--------------------+------------+--------+--------------------+ only showing top 20 rows
covid_data_pandas = covid_data.toPandas()
covid_data_pandas.head()
| content | date | followersCount | friendsCount | keyword | lang | location | retweetcount | verified | sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | @DrWoodcockFDA I didn’t see any information on... | 2022-04-29T23:59:27+00:00 | 100 | 181 | pfizer | en | Western Australia | 0 | False | -0.0500 |
| 1 | @ernestorr Ernesto, cómo estás, me llegó el tu... | 2022-04-29T23:59:14+00:00 | 486 | 744 | pfizer | es | CABA | 0 | False | 0.0000 |
| 2 | @jaredpolis we need your help to speed up the ... | 2022-04-29T23:58:59+00:00 | 4 | 95 | pfizer | en | 0 | False | 0.2000 | |
| 3 | @ArtysHouse @isthisnetaken @JustinTrudeau @pfi... | 2022-04-29T23:58:54+00:00 | 75 | 130 | pfizer | en | Throng la Pass | 0 | False | -0.3875 |
| 4 | Thirteen percent of the WHO’s funding ($300 mi... | 2022-04-29T23:58:51+00:00 | 73 | 244 | pfizer | en | 0 | False | 0.0000 |
from geopy.geocoders import Nominatim
import numpy as np
geolocator = Nominatim(user_agent="maps")
covid_data_pandas['lat'] = np.where(geolocator.geocode(covid_data_pandas['location']) is None,"NA",geolocator.geocode(covid_data_pandas['location']))
covid_data_pandas.to_csv('/home/svallur/bigdata/covid_data.csv')
covid_data_pandas.shape
(75000, 11)
The Google sheet Api is used to write the summarized data to google sheets
import gspread
from oauth2client.service_account import ServiceAccountCredentials
from pprint import pprint
scope = ["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("/home/svallur/bigdata/client_secrets.json", scope)
client = gspread.authorize(creds)
from googleapiclient.discovery import build
service = build('sheets', 'v4', credentials=creds)
SAMPLE_RANGE_NAME = 'Covid tweets!A1:AA1000000'
def Export_Data_To_Sheets():
response_date = service.spreadsheets().values().update(
spreadsheetId='1PmzB4zcglH7YP24QYCOSjEZDu8FejG7WM7TAlt-PFTQ',
valueInputOption='RAW',
range=SAMPLE_RANGE_NAME,
body=dict(
majorDimension='ROWS',
values=covid_data_pandas.T.reset_index().T.values.tolist())
).execute()
print('Sheet successfully Updated')
Export_Data_To_Sheets()
Sheet successfully Updated
The data is also written to Google Drive as a CSV file to ensure klarge data files are transferred properly
import gspread
from oauth2client.service_account import ServiceAccountCredentials
from pprint import pprint
from googleapiclient.discovery import build
scope = ["https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("/home/svallur/bigdata/client_secrets.json", scope)
service = build('drive', 'v3', credentials=creds)
from googleapiclient.http import MediaFileUpload
folder_id = '1irfd8Bm7tOeEnWBgfpQ40VpBiv4arZnv'
file_name = 'covid_data.csv'
mime_type = 'text/csv'
file_metadata = {'name': file_name
, 'parents': ['1irfd8Bm7tOeEnWBgfpQ40VpBiv4arZnv']}
media = MediaFileUpload('/home/svallur/bigdata/covid_data.csv', mimetype = 'text/csv')
service.files().create(body = file_metadata, media_body = media).execute()
{'kind': 'drive#file',
'id': '1V0gmvMMNyabPcK_zOYHLiWv5BR7TMgbM',
'name': 'covid_data.csv',
'mimeType': 'text/csv'}
Built customised wordclouds to visualise the frequent words that appear in tweets for each of the vaccines
import numpy as np
criteria = [covid_data_pandas['sentiment'].between(-1, -0.01), covid_data_pandas['sentiment'].between(-0.01, 0.01), covid_data_pandas['sentiment'].between(0.01, 1)]
values = ['negative', 'neutral', 'positive']
covid_data_pandas['sentiment_cat'] = np.select(criteria, values, 0)
import matplotlib.pyplot as plt
import regex as re
from wordcloud import WordCloud, ImageColorGenerator
import wordninja
from spellchecker import SpellChecker
from collections import Counter
import nltk
import math
import random
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words.add("amp")
[nltk_data] Downloading package wordnet to /home/svallur/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] /home/svallur/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package omw-1.4 to /home/svallur/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
def flatten_list(l):
return [x for y in l for x in y]
def is_acceptable(word: str):
return word not in stop_words and len(word) > 2
# Color coding our wordclouds
def red_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
return f"hsl(0, 100%, {random.randint(25, 75)}%)"
def green_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
return f"hsl({random.randint(90, 150)}, 100%, 30%)"
def yellow_color_func(word, font_size, position, orientation, random_state=None,**kwargs):
return f"hsl(42, 100%, {random.randint(25, 50)}%)"
# Reusable function to generate word clouds
def generate_word_clouds(neg_doc, neu_doc, pos_doc):
# Display the generated image:
fig, axes = plt.subplots(1,3, figsize=(20,10))
wordcloud_neg = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neg_doc))
axes[0].imshow(wordcloud_neg.recolor(color_func=red_color_func, random_state=3), interpolation='bilinear')
axes[0].set_title("Negative Words")
axes[0].axis("off")
wordcloud_neu = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(neu_doc))
axes[1].imshow(wordcloud_neu.recolor(color_func=yellow_color_func, random_state=3), interpolation='bilinear')
axes[1].set_title("Neutral Words")
axes[1].axis("off")
wordcloud_pos = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(" ".join(pos_doc))
axes[2].imshow(wordcloud_pos.recolor(color_func=green_color_func, random_state=3), interpolation='bilinear')
axes[2].set_title("Positive Words")
axes[2].axis("off")
plt.tight_layout()
# plt.show();
return fig
def get_top_percent_words(doc, percent):
# Returns a list of "top-n" most frequent words in a list
top_n = int(percent * len(set(doc)))
counter = Counter(doc).most_common(top_n)
top_n_words = [x[0] for x in counter]
# print(top_n_words)
return top_n_words
def clean_document(doc):
spell = SpellChecker()
lemmatizer = WordNetLemmatizer()
# Lemmatize words (needed for calculating frequencies correctly )
doc = [lemmatizer.lemmatize(x) for x in doc]
# Get the top 10% of all words. This may include "misspelled" words
top_n_words = get_top_percent_words(doc, 0.1)
# Get a list of misspelled words
misspelled = spell.unknown(doc)
# Accept the correctly spelled words and top_n words
clean_words = [x for x in doc if x not in misspelled or x in top_n_words]
# Try to split the misspelled words to generate good words (ex. "lifeisstrange" -> ["life", "is", "strange"])
words_to_split = [x for x in doc if x in misspelled and x not in top_n_words]
split_words = flatten_list([wordninja.split(x) for x in words_to_split])
# Some splits may be nonsensical, so reject them ("llouis" -> ['ll', 'ou', "is"])
clean_words.extend(spell.known(split_words))
return clean_words
def get_log_likelihood(doc1, doc2):
doc1_counts = Counter(doc1)
doc1_freq = {
x: doc1_counts[x]/len(doc1)
for x in doc1_counts
}
doc2_counts = Counter(doc2)
doc2_freq = {
x: doc2_counts[x]/len(doc2)
for x in doc2_counts
}
doc_ratios = {
# 1 is added to prevent division by 0
x: math.log((doc1_freq[x] +1 )/(doc2_freq[x]+1))
for x in doc1_freq if x in doc2_freq
}
top_ratios = Counter(doc_ratios).most_common()
top_percent = int(0.1 * len(top_ratios))
return top_ratios[:top_percent]
# Function to generate a document based on likelihood values for words
def get_scaled_list(log_list):
counts = [int(x[1]*100000) for x in log_list]
words = [x[0] for x in log_list]
cloud = []
for i, word in enumerate(words):
cloud.extend([word]*counts[i])
# Shuffle to make it more "real"
random.shuffle(cloud)
return cloud
def get_smart_clouds(df):
neg_doc = flatten_list(df[df['sentiment_cat']=='negative']['words'])
neg_doc = [x for x in neg_doc if is_acceptable(x)]
pos_doc = flatten_list(df[df['sentiment_cat']=='positive']['words'])
pos_doc = [x for x in pos_doc if is_acceptable(x)]
neu_doc = flatten_list(df[df['sentiment_cat']=='neutral']['words'])
neu_doc = [x for x in neu_doc if is_acceptable(x)]
# Clean all the documents
neg_doc_clean = clean_document(neg_doc)
neu_doc_clean = clean_document(neu_doc)
pos_doc_clean = clean_document(pos_doc)
# Combine classes B and C to compare against A (ex. "positive" vs "non-positive")
top_neg_words = get_log_likelihood(neg_doc_clean, flatten_list([pos_doc_clean, neu_doc_clean]))
top_neu_words = get_log_likelihood(neu_doc_clean, flatten_list([pos_doc_clean, neg_doc_clean]))
top_pos_words = get_log_likelihood(pos_doc_clean, flatten_list([neu_doc_clean, neg_doc_clean]))
# Generate syntetic a corpus using our loglikelihood values
neg_doc_final = get_scaled_list(top_neg_words)
neu_doc_final = get_scaled_list(top_neu_words)
pos_doc_final = get_scaled_list(top_pos_words)
# Visualise our synthetic corpus
fig = generate_word_clouds(neg_doc_final, neu_doc_final, pos_doc_final)
return fig
# Convert string to a list of words
wordcloud_df = covid_data_pandas[covid_data_pandas.keyword == 'moderna']
wordcloud_df['words'] = wordcloud_df.content.apply(lambda x:re.findall(r'\w+', x ))
get_smart_clouds(wordcloud_df).savefig("/home/svallur/bigdata/moderna_wordclouds.png", bbox_inches="tight")
print("Wordcloud for moderna: ")
/home/svallur/.local/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until
Wordcloud for moderna:
# Convert string to a list of words
wordcloud_df = covid_data_pandas[covid_data_pandas.keyword == 'pfizer']
wordcloud_df['words'] = wordcloud_df.content.apply(lambda x:re.findall(r'\w+', x ))
get_smart_clouds(wordcloud_df).savefig("/home/svallur/bigdata/pfizer_wordclouds.png", bbox_inches="tight")
print("Wordcloud for pfizer: ")
/home/svallur/.local/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until
Wordcloud for pfizer:
# Convert string to a list of words
wordcloud_df = covid_data_pandas[covid_data_pandas.keyword == 'covaxin']
wordcloud_df['words'] = wordcloud_df.content.apply(lambda x:re.findall(r'\w+', x ))
get_smart_clouds(wordcloud_df).savefig("/home/svallur/bigdata/covaxin_wordclouds.png", bbox_inches="tight")
print("Wordcloud for covaxin: ")
/home/svallur/.local/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until
Wordcloud for covaxin:
The summarised data that is uploaded to Google sheets and Google Drive is imported into Tableau dashboard through a live connection and dynamic visuals are built on them
1. Among these vaccines, covaxin seem to have better sentiment among people, followed by pfizer and moderna which is slightly lagging compared to pfizer
2. The tweets from verified profiles seem to have a better sentiment (positive) than that of non-verified profiles. This could be possible due to the misinformation that is mainly soread from fake/unverified accounts
3. Covaxin seems to be the most frequent appearance in tweets from users with high followers count. This could be mainly driven by popular people from India who have massive following on social media platforms.
4. The recent improvement in sentiment seems to be correlated with the recent increase in vaccinations.
For e.g., there is a surge in public sentiment for covaxin on March 30, this could be a driven by increased vaccinations during March 23 to Mar 30, in india.
5. The wordcloud for the negative tweets for Pfizer vaccine show words like death, sick and bad which could be due to the misinformation spread on the platform.
6. The wordclouds for negative tweets for moderna vaccine show words like kids which could be due to the scepticism on vaccine efficacy for children and pfizer being the preferred vaccine among children
From the project, what i have noticed is that the extent of vaccinations and sentiment among public could be very likely correlated. However, the insights generated in this project may not be statistically significant given the narrow timeperiod considered, owing to a few constraints. The correlation could be strong because as more and more get vaccinated the spread of virus decreases gradually and people would be much confident on the efficacy of the vaccines. This in turn motivates more people to get vaccinated and a better sentiment among the masses.
There is a enormous amount of data available on the internet these days and it is difficult for the common public to parse through misinformation and spam to get accurate information.
Next Steps: In the future, i would like to extend the scope of the project to a broader timeperiod to look at the evolution of sentiment over the last couple of years. I would also like to include other vaccines like Johnson & Johnson, SPutnik in the analysis for a better understanding of public sentiment.
1. https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
2. https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/
3. https://medium.com/@aieeshashafique/exploratory-data-analysis-using-pyspark-dataframe-in-python-bd55c02a2852
4. https://www.techwithtim.net/tutorials/google-sheets-python-api-tutorial/
5. https://help.tableau.com/current/pro/desktop/en-us/data_explore_analyze_interact.htm
6. https://towardsdatascience.com/sentiment-analysis-of-covid-19-vaccine-tweets-dc6f41a5e1af